Patient classification and attribute assessment based on machine learning techniques in the qualification process for surgical treatment of adrenal tumours

Adrenal gland incidentaloma is frequently identified through computed tomography and poses a common clinical challenge. Only selected cases require surgical intervention. The primary aim of this study was to compare the effectiveness of selected machine learning (ML) techniques in proper qualifying patients for adrenalectomy and to identify the most accurate algorithm, providing a valuable tool for doctors to simplify their therapeutic decisions. The secondary aim was to assess the significance of attributes for classification accuracy. In total, clinical data were collected from 33 patients who underwent adrenalectomy. Histopathological assessments confirmed the proper selection of 21 patients for surgical intervention according to the guidelines, with accuracy reaching 64%. Statistical analysis showed that Supported Vector Machines (linear) were significantly better than the baseline (p < 0.05), with accuracy reaching 91%, and imaging features of the tumour were found to be the most crucial attributes. In summarise, ML methods may be helpful in qualifying patients for adrenalectomy.


Study population
From a database of 264 Caucasian patients with AI, the clinical data of 33 patients older than 18, who met the criteria for surgical treatment according to the guidelines of the Polish Society of Endocrinology, were used in this retrospective, single-center study 23 .Patients had been hospitalized and qualified for an operation in the Department of Endocrinology, Diabetology, and Internal Medicine at the University Clinical Hospital in Białystok between 2017 and 2019.All qualified patients underwent laparoscopic lateral transperitoneal adrenalectomy.
We searched our institutional electronic database and confirmed proper qualifications in 21 of the 33 patients selected for operation according to the obtained results of postoperative histopathological examinations.Definitive diagnoses were established through histopathology, revealing a study group comprising five cases of pheochromocytomas, two cases of ACCs, five cases of Cushing's syndrome, and nine cases of primary hyperaldosteronisms.The remaining 12 cases consisted of patients with benign, hormonally inactive lesions, for whom surgical intervention was unnecessary.This study complied with the Declaration of Helsinki and was approved by the Ethical Committee of Białystok (no.APK.002.14.2022).Informed consent for study participation was obtained from all enrolled patients.

Biochemical and radiographic analyses
All patients completed a comprehensive endocrine work-up aimed at studying the hormonal status of AI: aldosterone/renin ratio, 24 h urine collection for metanephrines and normetanephrines, and 1 mg overnight DXM suppression test.Serum cortisol levels after 1 mg DXM > 5 µg/dL confirmed hypercortisolism, whereas serum concentrations of cortisol between 1.9 and 5.0 µg/dL were considered evidence of possible autonomous cortisol secretion.To confirm the diagnosis of CS, the serum concentration of ACTH was measured.The diagnosis of primary aldosteronism was confirmed with a saline infusion test.Hormonal variables were measured in the same laboratory using commercially available kits as previously described 24 .Additionally, serum concentrations of sodium and potassium were measured.Every adrenal lesion was assessed with CT as per the following criteria: size, lateralization, tissue density measured in Hounsfield units (HU), and contrast washout values.
CT can be performed with or without contrast enhancement.In our study, lesions with a density of ≤ 10 HU were considered benign.A tumour size > 5 cm is indicative of malignancy and is considered an indication for adrenalectomy 25 .In the adrenal mass, absolute washout is calculated in lesions with a density of > 10 HU, and it has been confirmed that a value > 60% is indicative of a benign lesion 12 .In our study, regular shape, size less than 5.0 cm, density ≤ 10 HU, absolute washout value > 60%, and relative washout > 40% were considered CT evidence of a benign adrenal mass.Abdominal CT was performed in all patients at the Radiology Department of our hospital.Moreover, every patient was screened for obesity, type 2 diabetes mellitus, impaired glucose tolerance (IGT), hyperlipidemia, nodular goiter, Hashimoto disease, Graves' disease, heart failure, atrial fibrillation (AF), ischemic heart disease, renal failure, and hypertension, especially severe and resistant arterial hypertension, which was taken into consideration, defined according to World Health Organization (WHO) criteria.The data were extracted according to the criteria recommended by the Polish Society of Endocrinology for AI management 23 .All extracted data were complete and credible according to medical standards.

Machine learning approach
In our study, we applied selected supervised ML methods with the main stages depicted in Fig. 1.In the preprocessing stage, nominal attributes were converted to numerical values using one-hot encoding, and all attributes were normalized to have the same range.During the experiments, we constructed a feature vector using all the available attributes and performed experiments with reduced feature sets selected using the backwards search method.The feature vectors were passed to the classification algorithm that assigned the subject to one of two classes: qualified or not qualified for adrenalectomy.

Feature vector attributes
Each patient in the study had 24 attributes.Eight attributes represented measurements on a ratio scale, and the remaining 15 represented measurements on a nominal scale.All nominal attributes had two values: female/male for gender and the presence or absence of features for other attributes.Table 2 shows all attributes with their scales and the summary of their values.Examples of CT images depicting adrenal tumours, illustrating the attributes used in this study, were presented in Fig. 2. For the attributes on the quotient scale, the median and interquartile range were given.For the nominal attributes, the table contains the case counts for each of the two possible values.

Classifiers
In our study, we used several classifiers, which are briefly described in this section 26,27 .
• Zero R-baseline approach that assigns examples to the majority class in the training set (ignores attribute values).• One rule is a classifier that uses only a single attribute for classification and assigns the subject to the majority class with the same attribute value in the training set.If attribute selection is performed based on the accuracy measure, the selected attribute has the highest accuracy in the training set.The algorithm was applied to nominal attributes.The numerical attributes were converted to nominal values using the discretization procedure described in 28 (with a minimum bucket size of 6).• Naïve Bayes is a classifier based on Bayes' theorem, with the assumption of feature independence.The prob- ability that a given feature vector x belongs to class c k is given in Eq. (1).
The predicted class c can be selected using the maximum probability (MAP) rule (2).
K is the number of classes.• K-nearest neighbors-classifies the subject based on the plurality vote of its k-nearest neighbors, where the neighborship is assessed based on a distance measure applied to examples in the training set.In this study, we used the Euclidian distance.
(1) • Logistic regression with ridge regularization models the probability that a given feature vector belongs to a particular class.It is based on the assumption that the logarithm of odds (log-odds) can be described using a linear combination of predictor variables, and thus, in case of two possible decision classes ( C = c 1 or C = c 2 ), the probability of x having class C = c 1 may be computed using formula (3).
The vector of the coefficients β is selected to minimize the cost function L (4).
where N is the number of examples in the training set, y i denotes whether sample i belongs to class c 1 ( y i = 1 ) or not ( y i = 0 ), x i is the feature vector of the i-th sample.• The SVM classifier separates classes with a hyperplane that has the largest margin (distance to the nearest data point).In our case, we used a soft-margin SVM that allows data points to cross the hyperplane, thereby reducing the separation requirement.The soft-margin separating the hyperplane is determined by minimizing (5) under the constraints given by ( 6) and (7).The hyperplane is represented by vector w normal to the plane and scalar b.The value ξ i captures the margin violation for sample i. Scalar is a regularization coefficient that controls the extent to which the margin violation is acceptable.There are N samples, where x i denotes the i-th sample feature vector and y i denotes the class of the sample (1 or − 1). ( Attributes used to construct the feature vector.The summary column for attributes in the ratio scale contains the median and interquartile range (given in parentheses).For nominal attributes, the summary column contains the number of subjects within each of the two groups having specific attribute values: male (female) for gender and absent (present) for other nominal attributes.To allow for nonlinear separation, the feature vectors x i can be transformed into another space, usually with more dimensions, where the hyperplane separation will result in nonlinear separation in the original space.The same effect is achieved by a kernel trick that computes the inner product in the transformed space without the explicit transformation of vectors from the original space.Popular types of kernels include linear, polynomial, and Gaussian radial basis function (RBF) kernels.• C4.5 Decision Tree-classifier that generates a decision tree based on C4.5.The C4.5 algorithm uses entropy to measure information gain when selecting attributes to split during the tree creation process.The nodes of the tree represent the decision rules, and the leaves represent decisions.We used the J48 implementation of C4.5 in Weka.• Random Forest-The algorithm creates a set of decision trees 29 , each learned using samples from the training set selected randomly with replacement and random subsets of features.The classification decision for a new sample is performed by voting-the decisions (votes) made by trees in the set are counted, and the class with the most votes wins.In this study, the set consisted of 100 trees.• Artificial Neural Network-In this study, we used a feed-forward multilayer network with a sigmoid activa- tion function in the hidden layers.The network was trained using stochastic gradient descent with momentum.The neural network consisted of three layers (input, hidden, and output), with the number of neurons in the hidden layer equal to the number of attributes and two neurons in the output layer (one for each class).All the variables in the equations in the manuscript are summarized in Table 3.

Results
In this section, we present the results of the experiments conducted in this study.In all experiments, we used algorithms implemented in the Weka software package 30 .During the first experiment, we evaluated 11 classifiers applied to the full attribute set, as shown in Table 4.The results were obtained using a tenfold stratified cross-validation scheme repeated 100 times with random reordering of the samples.Consequently, each classifier was trained and evaluated 1000 times on various datasets split into training (90%) and test (10%) subsets.Table 4 presents the average accuracy with standard deviations (SD) computed for the evaluations per classifier.
The number of patients qualified correctly and incorrectly were different.Therefore, the dataset was unbalanced with respect to the class attribute.Hence, the accuracy of the Zero-R classifier was determined to establish a baseline for further comparisons (Zero-R assigns the example to the most common class in the training set).Statistical analysis of the results performed with the paired t test, modified to account for using the same dataset multiple times with random reordering, proved that all methods except four (one rule, logistic regression, SVM with RBF kernel, C4.5 Decision Tree) were significantly better than the baseline (p < 0.05).As seen in Table 4, the best result of 91% was obtained for the SVM and linear kernel with soft margins.The K-nearest neighbors (with k = 1) gave the second-best result of 85%, followed by random forest with 84%.These results indicate that the application of ML methods may improve the decision-making process.

Experiment 2
To evaluate the importance of attributes for classification accuracy, we applied the wrapper method with the backwards best-first search method, with search termination after five nonimproving nodes 31 .Attribute selection was performed on the training subset obtained from the cross-validation split.After the attribute selection, the classifier was trained and evaluated on the test subset of the cross-validation split.The procedure was performed using a tenfold cross-validation scheme and repeated five times with random reordering of the samples.Table 5 shows the percentage of times each attribute was selected; attributes that were selected more frequently were Table 3. Summary of the variables used in the equations presented in the manuscript.better (more stable) indicators for issuing correct decisions.The most frequently selected attributes were tumour homogeneity (100%), maximum tumour diameter (98%), and obesity (100%).For the classifier, we used an SVM with a linear kernel that gave the best results in Experiment 1.

Experiment 3
In this experiment, we applied the attribute selection method from Experiment 2 combined with selected classifiers and evaluated the performance of the classifiers used on the reduced attribute set.The results were obtained with a tenfold cross-validation scheme repeated 100 times with random reordering of samples.The same classifier was used for attribute selection and classification processes.As seen in Table 6, prior attribute selections using the wrapper method did not lead to better accuracy of most trained classifiers; only in the case of K-nearest neighbors (k = 3) and C4.5 was a slight improvement observed.

Discussion
The decision to qualify a patient for surgery is not always correct, as verified by histopathological examination.In this study, correct qualification was confirmed in only 21 of the 33 selected patients.This highlights how significant problems with personalized medical approaches to the management of AI occur and delineates the need for improvement of diagnostic tools.We demonstrated the usefulness of ML predictive algorithms based on existing data for reliable automated and preoperative classification of AI.ML was found to enable a reasonable level of accuracy in qualifying patients for adrenalectomy.The results of this study seem to show that artificial Table 5.The percentage of times each attribute was selected using the wrapper method with a backwards search for SVM with a linear kernel in a tenfold cross-validation scheme.

Attribute
The precent of times each attribute was selected (%) intelligence can detect patterns that may help in making the correct decision.In developing our manuscript, we followed the requirements of providing the high quality and usefulness of our medical ML study 32 .
From the results of Experiment 1 in a group of people who met the criteria for surgery, ML methods produce promising results: 91% of correct decisions for SVM classifiers versus 64% correctness achieved by medical specialists.It should be mentioned that this is a preliminary study with a relatively small dataset.Enlarging the set allows the use of more complex classifiers, such as larger neural networks, that may lead to even better results.In this study, 23 attributes were used.Nevertheless, subsequent studies provide new diagnostic tools in patients with AI, e.g. the EURINE-ACT study presented a triple test with urine steroid metabolomics, imaging characteristics, and tumour diameter to improve the detection of ACC 33 .Hence, there are future perspectives to improve the application of ML techniques in the qualification for the surgical treatment of adrenal tumours through the involvement of more characteristics.
In Experiment 2, the attribute selection method was used to investigate the attributes that were most relevant to the correctness of the classification.The results obtained were consistent with expert knowledge: imaging features of the tumour, such as homogeneity and size, were found to be the most important.Additionally, 24-h urine collection for normetanephrins, 24-h urine collection for metanephrins, suppression test with 1 mg of DXM, and aldosterone/renin ratio were also indicated as very important factors.Interestingly, obesity is also important.In further investigation, in the case of decision trees, the obtained rule suggested that with a homogeneous tumour image, the patient's obesity significantly increased the chance of a pathological lesion.However, an attempt to reduce the set of attributes in Experiment 3 using the selection method from Experiment 2 did not improve the classification accuracy.This may indicate that it is difficult to establish a simple rule using only a few factors that result in high decision accuracy and that most of the selected data may be relevant for decision-making.
In our work we performed tuning of classifier hyperparameters using linear and grid search methods with internal cross-validation split on the training set.However, probably due to limited size of our dataset, the search did not lead to significant improvement over default parameter values proposed by the authors of Weka software package.As the alternative, our future plans include application of swarm methods for hyperparameter tunning, and also for feature selection [34][35][36] .
This study has several limitations.One of them is the small sample size.Thus, validation of these results in a large and well-balanced study population is necessary before clinical application.A larger number of patients with histopathologically confirmed tumours would have improved the accuracy of our results.Another constraint is the retrospective nature of the study and its inherent limitations.Similar limitations have been repeatedly mentioned in studies presented in Table 1.The comparison of accuracy of our study with other studies is difficult because they have different designs and do not consider the same factors.In the case of our work, the best accuracy was obtained for the SVM classifier (90.98%) as an average of 1000 iterations of the learning process.It should be noted that the accuracy was determined on the test set, which was not used in the selection of features as well as not used in the learning process, therefore the presented accuracy values are unbiased estimators.In other studies, such as Yi's research, there was no separation between the training and test sets 10 .Another important point to mention is that, in our study, selected ML techniques (including the best performing Linear SVM) achieved a statistically significant advantage in accuracy over patient qualification performed by medical personnel.
Nonetheless, a significant strength of our study lies in its pioneering nature.It is the first study to incorporate both imaging and hormonal test results in ML techniques, encompassing the full spectrum of lesions qualifying for surgical treatment.Despite its limitations, especially its limited accuracy, our study provides valuable insights that lay the groundwork for further research in this field.Future studies with larger and more diverse cohorts, along with prospective designs, are essential to validate and extend our findings for clinical application.

Conclusions
ML-based methods could be used as an accurate diagnostic device to help avoid unnecessary surgeries in patients with benign and non-functional adrenal masses.However, our results have not been adopted in daily practice thus far, and further studies are needed to investigate the application of other attributes in the decision-making process and the extension of the training database.

Figure 2 .
Figure 2. Examples CT image with adrenal tumour showing attributes used in this study: (a) maximal diameter for tumour with homogeneity feature absent and laterization feature present, (b) minimal diameter for tumour with homogeneity feature absent and laterization feature present, (c) tumour with laterization feature absent, (d) tumour with homogeneity feature present.
The value of C for the sample of k-th class X Random variable representing the feature vector of a sample M Number of features x Feature vector representing a sample, x = [1, x (1) , x ,(2) , . . ., x (M) ] T y Scalar value representing the class of a sample K Number of classes p(A) Probability of event A p(A|B) Conditional probability of event A given event B has occured β Vector of coefficients in logistic regression, β = [1, β (1) , β (2) , . . ., β (M) ] T r Ridge regularization scalar coefficient in logistic regression w Normal vector defining SVM hyperplane, = [w (1) , w (2) , . . ., w (M) ] T ξ i Scalar value controlling margin violation constraint in SVM for the i-th sample Regularization scalar coefficient in SVM

Table 1 .
Summary of studies looking at the application of ML techniques in AI management.

Table 4 .
Percent of properly classified subjects using all attributes.

Table 6 .
Percent of properly classified subjects with prior attribute selection.